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In the same spirit as Tsybakov, we define the optimality of an aggregation procedure in the 
problem of classification. Using an aggregate with exponential weights, we obtain an optimal 
rate of convex aggregation for the hinge risk under the margin assumption. Moreover, we obtain 
an optimal rate of model selection aggregation under the margin assumption for the excess Bayes 
risk. 
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1. Introduction 

Let (X, A) be a measurable space. We consider a random variable (X, Y) on X x { — 1, 1} 
with probability distribution denoted by ir. Denote by P x the marginal of n on X and by 

r\(x) A = P(Y = 1| X = x) the conditional probability function of Y = 1, knowing that X = 

x. We have n i.i.d. observations of the couple (X,Y) denoted by D n = ((Xi,Yi))i—i „. 

The aim is to predict the output label Y for any input X in X from the observations 
D n . 

We recall some usual notation for the classification framework. A prediction rule is 
a measurable function f:X i — ► { — 1,1}. The misclassification error associated with 
/is 

R(f)=F(Y^f(X)). 
It is well known (see, e.g., Devroye et al. [14]) that 

min R(f) = R(f*) d = R* , 

where the prediction rule /*, called the Bayes rule, is defined by 

/* (x) d = sign(277(ir) - 1) V.t E X. 
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The minimal risk R* is called the Bayes risk. A classifier is a function, /„ = f n (X, D n ), 
measurable with respect to D n and X with values in {—1,1}, that assigns to the sample 
D n a prediction rule /„(•, D n ) : X i — > { — 1,1}. A key characteristic of /„ is the gener- 
alization error E[i£(/„)], where 

R(f n )=P(Y^f n (X)\D n ). 

The aim of statistical learning is to construct a classifier /„ such that E[R(f n )] is as 
close to R* as possible. Accuracy of a classifier /„ is measured by the value E[R(f n ) — 
R*], called the excess Bayes risk of f n . We say that the classifier /„ learns with 
the convergence rate tp(n), where (ip{n)) n eN is a decreasing sequence, if there exists an 
absolute constant C > such that for any integer n, E[i?(/„) — R*] < Ctjj(n). 

Given a convergence rate, Theorem 7.2 of Devroye et al. [14] shows that no classifier 
can learn at least as fast as this rate for any arbitrary underlying probability distribution 
7r. To achieve rates of convergence, we need a complexity assumption on the set which 
the Bayes rule /* belongs to. For instance, Yang [36, 37] provide examples of classifiers 
learning with a given convergence rate under complexity assumptions. These rates cannot 
be faster than n~ x / 2 (cf. Devroye et al. [14]). Nevertheless, they can be as fast as n^ 1 if we 
add a control on the behavior of the conditional probability function r\ at the level 1/2 (the 
distance |t;(-) — 1/2 1 is sometimes called the margin). For the problem of discriminant 
analysis, which is close to our classification problem, Mammon and Tsybakov [25] and 
Tsybakov [34] have introduced the following assumption. 

(MA) Margin (or low noise) assumption. The probability distribution tt on the 
space X x { — 1, 1} satisfies MA^j with 1 < K < +00 if there exists Cq > such that 

E[|/(A)-/*(A)|]< Co (i?(/)-ir) 1/K , (1) 
for any measurable function f with values in {—1, 1}. 

According to Tsybakov [34] and Bouchcron et al. [7] , this assumption is equivalent to 
a control on the margin given by 

F[\2rj(X)-l\<t]<ct a V0<i<l. 

Several example of fast rates, that is, rates faster than n" 1 / 2 , can be found in Blanchard 
et al. [5], Stcinwart and Scovcl [31, 32], Massart [26], Massart and Nedelec [28], Massart 
[27] and Audibert and Tsybakov [1] . 

The paper is organized as follows. In Section, 2 wc introduce definitions and proce- 
dures which are used throughout the paper. Section 3 contains oracle inequalities for our 
aggregation procedures w.r.t. the excess hinge risk. Section 4 contains similar results for 
the excess Bayes risk. Proofs are postponed to Section 5. 
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2. Definitions and procedures 
2.1. Loss functions 

Convex surrogates <j> for the classification loss are often used in algorithm (Cortes and 
Vapnic [13], Freund and Schapire [15], Lugosi and Vayatis [24], Friedman et al. [16], 
Biihlman and Yu [8], Bartlett et al. [2. 3]). Let us introduce some notation. Take <f> to 
be a measurable function from R to R. The risk associated with the loss function (f> is 
called the 0-risk and is defined by 

A^(f) d ^EWYf(X))}, 
where / : X i — ► R is a measurable function. The empirical 0-risk is defined by 

A w if) ^lj2mf(x t )) 

n * — ' 

i=l 

and we denote by A^* the infimum over all real-valued functions inf >r (f) . 

Classifiers obtained by minimization of the empirical 0-risk, for different convex losses, 
have been proven to have very good statistical properties (cf. Lugosi and Vayatis [24], 
Blanchard et al. [6], Zhang [39], Steinwart and Scovcl [31, 32] and Bartlett et al. [3]). 
A wide variety of classification methods in machine learning are based on this idea, in 
particular, on using the convex loss 4>(x) = max(l — x, 0) associated with support vector 
machines (Cortes and Vapnik [13], Scholkopf and Smola [30]), called the hinge loss. The 
corresponding risk is called the hinge risk and is defined by 

A(/) d ^ f E[max(l-y/(X),0)], 
for any measurable function / : X i — > R. The optimal hinge risk is defined by 

It is easy to check that the Bayes rule /* attains the infimum in (2) and that 

R(f)-R*<A(f)-A*, (3) 

for any measurable function / with values in R (cf. Lin [23] and generalizations in Zhang 
[39] and Bartlett et al. [3]), where we extend the definition of R to the class of real- 
valued functions by R(f) = i?(sign(/)). Thus, minimization of the excess hinge risk, 
A(f) — A*, provides a reasonable alternative for minimization of the excess Bayes risk, 
R(f)-R*. 



Aggregation of classifiers 



1003 



2.2. Aggregation procedures 

Now, wc introduce the problem of aggregation and the aggregation procedures which will 
be studied in this paper. 

Suppose that we have M > 2 different classifiers /i, . . . , fu taking values in {—1, 1}. 
The problem of model selection type aggregation, as studied in Nemirovski [29], Yang [38], 
Catoni [10, 11] and Tsybakov [33], consists of the construction of a new classifier /„ (called 
an aggregate) which approximately mimics the best classifier among /i, . . . , /a/. In most 
of these papers the aggregation is based on splitting the sample into two independent 
subsamples, and Df, of sizes m and I, respectively, where m + I = n. The first 
subsample, D^, is used to construct the classifiers /i, . . . , fu and the second subsample, 
Df, is used to aggregate them, that is to construct a new classifier that mimics, in a 
certain sense, the behavior of the best among the classifiers fj,j = 1, . . . , M. 

In this paper, we will not consider the sample splitting and will concentrate only on 
the construction of aggregates (following Juditsky and Nemirovski [18], Tsybakov [33], 
Birge [4], Bunea et al. [9]). Thus, the first subsample is fixed and, instead of classifiers 
ft, ... , /a/, we have fixed prediction rules fx, ... , /a/. Rather than working with a part of 
the initial sample we will suppose, for notational simplicity, that the whole sample D n 
of size n is used for the aggregation step instead of a subsample Df. 

Let T = {fx, . . . , /a/} be a finite set of real- valued functions, where M > 2. An aggre- 
gate is a real-valued statistic of the form 



Let 4> be a convex loss for classification. The Empirical Risk Minimization aggregate 
(ERM) is defined by the weights 



/n=5> Cn) (/)/> 



where the weights (w^- n \f)) f e jr satisfy 



w 



<">(/) >0 and = 1. 




0, for all other / G J 7 , 



1, for one / G T such that AT (f) = min A ( J > (g), 



V/ G T. 



The ERM aggregate is denoted by fn 




V/GJP, 



where N is the number of functions in T minimizing the empirical 0-risk. The averaged 
ERM aggregate is denoted by j^ AERM ' _ 
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The Aggregation with Exponential Weights aggregate ( AEW) is defined by the weights 



[(w)m= exp(-nAr(/)) 



w^(/) = v V;/ V/eJR (4) 



The AEW aggregate is denoted by /« AEW . 

The cumulative AEW aggregate is an on-line procedure defined by the weights 

(</>), 



The cumulative AEW aggregate is denoted by /„ E . 

When is a class of prediction rules, intuitively, the AEW aggregate is more robust 
than the ERM aggregate w.r.t. the problem of overfitting. If the classifier with smallest 
empirical risk is overfitted, that is, if it fits too many to the observations, then the ERM 
aggregate will be overfitted. But, if other classifiers in T are good classifiers, then the 
aggregate with exponential weights will consider their "opinions" in the final decision 
procedure and these opinions can balance with the opinion of the overfitted classifier in 
J 7 , which can be false because of its overfitting property. The ERM only considers the 
"opinion" of the classifier with the smallest risk, whereas the AEW takes into account 
all of the opinions of the classifiers in the set T . 

The exponential weights, defined in (4), can be found in several situations. First, one 
can check that the solution of the minimization problem 



(M M M \ 

Y, \^ (fj) log A,- : £ A,- < 1, A,- > 0,j = 1, . . . , M 

i=l 3 = 1 3=1 I 



(5) 



for all e > is 



exp(-(^(/ 3 ))/ e ) 

^7 = Tr TT\ V7 = l, ...,M. 

Ef=iexp(-(^ ) (/ fc ))A) 

Thus, for e — 1/n, we find the exponential weights used for the AEW aggregate. Second, 
these weights can also be found in the theory of prediction of individual sequences (cf. 
Vovk [35]). 



2.3. Optimal rates of aggregation 

Now, we introduce a concept of optimality for an aggregation procedure and for rates 
of aggregation, in the same spirit as in Tsybakov [33] (where the regression problem is 
treated). Our aim is to prove that the aggregates introduced above are optimal in the 
following sense. We denote by Vk, the set of all probability measures 7r on X x { — 1,1} 
satisfying MA(k). 
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Definition 1. Let <f> be a loss function. The remainder term y(n, M, k, T, tt) is called 
an optimal rate of model selection type aggregation (MS-aggregation) for the 
<p-risk if the two following inegualities hold: 

(i) yj 7 = {/i, . . . , /a/}, there exists a statistic f n , depending on T , such that W £ V K , 
Vn> 1, 

E[A ( «(/„) - AW*] < mm(A W (.f) ~ A (0) *) + Cn(n,M,K,?,ir); (6) 

f Gz^F 

(ii) 3 T = {/i, . . . , Jm} such that for any statistic f n , 3tt G 7\ ? Vn > 1 

E[A<« (/„) - A^*] > min(A(« (/) - A(«*) + C 27 (n, M, «, ^, tt). (7) 

Here, C\ and Ci are positive constants which may depend on k. Moreover, when these two 
inequalities are satisfied, we say that the procedure f n , appearing in (6), is an optimal 
MS -aggregate for the <p-risk. If C denotes the convex hull of T and if (6) and (7) are 
satisfied with min/ e ^(A^(/) - A^>) replaced by min /6C (A^ (/) - A**), then we say 
that 7(n, M, k, T ', tt) is an optimal rate of convex aggregation type for the <p-risk 
and f n is an optimal convex aggregation procedure for the (p-risk. 

In Tsybakov [33], the optimal rate of aggregation depends only on M and n. In our 
case, the residual term may be a function of the underlying probability measure tt, of 
the class J- and of the margin parameter k. Note that, without any margin assumption, 
we obtain \J (JogMj/n for the residual, which is free from tt and T . Under the margin 
assumption, we obtain a residual term dependent of tt and T and it should be interpreted 
as a normalizing factor in the ratio 

E[AW(/») - AW*] - min /g ^(AW(/) - 
7(71, A/, k, T , tt) 

In that case, our definition does not imply the uniqueness of the residual. 

Remark 1 . Observe that a linear function achieves its maximum over a convex polygon 
at one of the vertices of the polygon. The hinge loss is linear on [—1, 1] and C is a convex 
set, thus MS-aggrcgation or convex aggregation of functions with values in [—1,1] arc 
identical problems when we use the hinge loss. That is, we have 

mmA(/)=minA(/). (8) 

3. Optimal rates of convex aggregation for the hinge 
risk 



Take M functions /i,...,/a/ with values in [—1,1]. Consider the convex hull C = 
Conv(/i, . . . , /a/). We want to mimic the best function in C using the hinge risk and 
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working under the margin assumption. We first introduce a margin assumption w.r.t. 
the hinge loss. 

(MAH) Margin (or low noise) assumption for hinge risk. The probability distri- 
bution 7r on the space X x {—1, 1} satisfies the margin assumption for hinge risk MAH(re) 
with parameter 1 < n < +oo if there exists c > such that 

E[\f(X)-r(X)\}<c(A(f)-A*) 1/K (9) 
for any function f on X with values in [—1, 1]. 

Proposition 1. The assumption MAH(k J is equivalent to the margin assumption 
MA(k ). 

In what follows, we will assume that MA(k) holds and thus also that MAH(k) holds. 
The AEW aggregate of M functions fx,.. ■ ,/m with values in [—1, 1], introduced in 
(4) for a general loss, has a simple form for the case of the hinge loss, given by 

M 
3 = 1 

(10) 

where «,<») (/,) = ^^W13 V j = 1, . . . , M. 
In Theorems 1 and 2, we state the optimality of our aggregates in the sense of Definition 

1. 

Theorem 1 (Oracle inequality). Let k> 1. We assume that ir satisfies MA(kJ. We 
denote by C the convex hull of a finite set T of functions fx, ... , with values in [—1,1]. 
Let f n be either of the four aggregates introduced in Section 2.2. Then, for any integers 
M > 3, n > 1, /„ satisfies the inequality 

E[A(f n )-A*]<min(A(f)-A*) 



i /rrdn /£C (A(/)-,4*)V'qogM , flogM^ 



where C = 32(6 V 537c V 16(2c + 1/3)) for the ERM, AERM and AEW aggregates with 
n> 1, c>0 is the constant in (9) and C = 32(6 V 537cV 16(2c+ l/3))(2 V (2k - 1)/(k - 1) 
for the CAEW aggregate with k > 1. For k = 1, the CAEW aggregate satisfies 

E[A(/( CAEW >) - A*] < mm(A(f) - A*) 



+ 2C 



min/ e c(-4(/) - A*)\ogM (\ogM)\ogn 
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Theorem 2 (Lower bound). Let k > 1 and let M,n be two integers such that 
21og 2 M < n. We assume that the input space X is infinite. There exists an absolute con- 
stant C > 0, depending only on k and c, and a set of prediction rules T = {/i, . . . , /m} 
such that for any real-valued procedure f n , there exists a probability measure tt satisfying 
MA(k), for which 



where C = c re (4e)- 1 2- 2K ( K - 1 )A 2 «- 1 )(l g2)- re /( 2re - 1 ) and c> is the constant in (9). 

Combining the exact oracle inequality of Theorem 1 and the lower bound of Theorem 
2, we see that the residual 



is an optimal rate of convex aggregation of M functions with values in [—1, 1] for the 
hinge loss. Moreover, for any real-valued function /, we have max(l — yip(f(x)),0) < 
max(l — yf(x),0) for all y G {—1, 1} and x € X, thus 



A(ip(f)) - A* < A(f) - A* , where V(x) =max(-l,min(x,l)), VieK. (12) 
Thus, by aggregating ip(fi), ^(/m)j it is easy to check that 



is an optimal rate of model-selection aggregation of M real- valued functions f\ , . . . , /a/ 
w.r.t. the hinge loss. In both cases, the aggregate with exponential weights, as well as 
ERM and AERM, attains these optimal rates and the CAEW aggregate attains the 
optimal rate if k > 1. Applications and learning properties of the AEW procedure can 
be found in Lecue [20, 21] (in particular, adaptive SVM classifiers are constructed by 
aggregating only (logn) 2 SVM estimators). In Theorem 1, the AEW procedure satisfies 
an exact oracle inequality with an optimal residual term whereas in Lecue [21] and 
Lecue [20] the oracle inequalities satisfied by the AEW procedure are not exact (there 
is a multiplying factor greater than 1 in front of the bias term) and in Lecue [21], the 
residual is not optimal. In Lecue [20] , it is proved that for any finite set T of functions 
f\ , . . . , /m with values in [—1,1] and any e > 0, there exists an absolute constant C(e) > 
such that, for C the convex hull of T, 



E[A(f n )~A*]>mm(A(f)-A*) 





(11) 





(13) 
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This oracle inequality is good enough for several applications (see the examples in Lecue 
[20]). Nevertheless, (13) can be easily deduced from Theorem 1 using Lemma 3 and may 
be inefficient for constructing adaptive estimators with exact constants (because of the 
factor greater than 1 in front of min/gc(^4.(/) — A*)). Moreover, oracle inequalities with 
a factor greater than 1 in front of the oracle min/ 6 c( J 4(/) — A*) do not characterize the 
real behavior of the technique of aggregation which we are using. For instance, for any 
strictly convex loss <j>, the ERM procedure satisfies (cf. Chesneau and Lecue [12]) 

E[^ (0) (.^ ERM) ) ~ < (1 + e) min(A^(/) - A^>) + C(e)^l (14) 

f £zJ~ Tl 

But, it has been recently proven, in Lecue [22], that the ERM procedure cannot mimic the 
oracle faster than yj (log M)/n, whereas, for strictly convex losses, the CAEW procedure 
can mimic the oracle at the rate (logM)/n (cf. Juditsky et al. [19]). Thus, for strictly 
convex losses, it is better to use the aggregation procedure with exponential weights than 
ERM (or even penalized ERM procedures (cf. Lecue [22])) to mimic the oracle. Non-exact 
oracle inequalities of the form (14) cannot tell us which procedure is better to use since 
both ERM and CAEW procedures satisfy this inequality. 

It is interesting to note that the rate of aggregation (11) depends on both the class J- 
and 7r through the term miny e c A(f) — A* . This is different from the regression problem 
(cf. Tsybakov [33]), where the optimal aggregation rates depend only on M and n. Three 
cases can be considered, where M(F, n) denotes min/ g c( J 4(/) — A*) and M may depend 
on n (i.e., for function classes T depending on n): 

1. If M{F,n) < a (i2S-M) K /(2«-i) 7 for an absolute constant a > 0, then the hinge risk 
of our aggregates attains min/ e c A(f) — A* with the rate (i2S-M.) K /( 2K - 1 ) ; which can 
be log M/n in the case k = l; 

2. If a ( lo s M )^/(2K-i) < M{T,tt) < b for some constants a, 6 > 0, then our aggregates 
mimic the best prediction rule in C with a rate slower than (!2£-M) k /( 2k - 1 ) 5 Du t, 
faster than ((logil/)/n) 1/2 ; 

3. If M.(T : it) > a > 0, where a > is a constant, then the rate of aggregation is 

° s , as in the case of no margin assumption. 



We can explain this behavior by the fact that not only k, but also minj e c^4(/) — 
A*, measures the difficulty of classification. For instance, in the extreme case where 
min/ e c^4(/) — A* = 0, which means that C contains the Bayes rule, we have the fastest 
rate ( lQgM W(2«-i) j n ^he worst cases, which are realized when k tends to oo or 
min/gc(^4.(/) — A*) > a > 0, where a > is an absolute constant, the optimal rate of 

aggregation is the slow rate ■ ' log — 



4. Optimal rates of MS-aggregation for the excess risk 

We now provide oracle inequalities and lower bounds for the excess Bayes risk. First, we 
can deduce, from Theorem 1 and 2, 'almost optimal rates of aggregation' for the excess 
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Bayes risk achieved by the AEW aggregate. Second, using the ERM aggregate, we obtain 
optimal rates of model selection aggregation for the excess Bayes risk. 

Using inequality (3), we can derive, from Theorem 1, an oracle inequality for the excess 
Bayes risk. The lower bound is obtained using the same proof as in Theorem 2. 

Corollary 1. Let T = {/i, . . . , fn} be a finite set of prediction rules for an integer M > 3 
and k > 1. We assume that n satisfies MA(kJ. Denote by f n either the ERM, the AERM 
or the AEW aggregate. For any number a > and any integer n, f n then satisfies 

E[R(f n )-R*]<2(l + a) min (R(fj) — R*) 

j — 1 M 

(15) 

(i it r \ k/(2k — 1) 
fogM 
n 

where C = 32(6V537cV 16(2c+ 1/3)). The CAEW aggregate satisfies the same inequality 
w^/iC = 32(6V537cV16(2c+1/3))(2V(2k-1)/(k-1) whenn> 1. For k = 1, the CAEW 
aggregate satisfies (15), where we need to multiply the residual by logn. 

Moreover, there exists a finite set of prediction rules T = {/i, . . . , /m} such that, for 
any classifier f n , there exists a probability measure 7r on X x { — 1,1} satisfying MA(n), 
such that, for any n> 1, a > 0, 



E[R{f n ) - R*} > 2(1 + a) min(i?(/) - R*) + C(a) 



l a it \ k/(2k — 1) 

log M x 



where C(a) > is a constant depending only on a. 
Due to Corollary 1, 



is an almost optimal rate of MS-aggregation for the excess risk and the AEW aggregate 
achieves this rate. The word "almost" is used here because min/ e ^-(i?(/) — R*) is multi- 
plied by a constant greater than 1. Oracle inequality (15) is not exact since the minimal 
excess risk over T is multiplied by the constant 2(1 + a) > 1. This is not the case when 
using the ERM aggregate, as explained in the following theorem. 

Theorem 3. Let k > 1. We assume that it satisfies MA(k). We denote by T = 
{/lj •••)/*?} a set of prediction rules. The ERM aggregate over T satisfies, for any 
integer n>l, 



(ERM) 

re/(2re-l)\ 



E[R(f^)-R*]<min(R(f)-R*) 



C 



min/gjr(i?(/) — R*) 1 / K \ogM (\ogM 
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where C = 32(6 V 537co V 16(2co + 1/3)) and cq is the constant appearing in MA(kJ. 

Using Lemma 3, we can deduce the results of Herbei and Wegkamp [17] from The- 
orem 3. Oracle inequalities under MA(k) have already been stated in Massart [27] (cf. 
Bouchcron et al. [7]), but the remainder term obtained is worse than the one obtained 
in Theorem 3. 

According to Definition 1. combining Theorem 3 and the following theorem, the rate 



is an optimal rate of MS-aggrcgation w.r.t. the excess Bayes risk. The ERM aggregate 
achieves this rate. 

Theorem 4 (Lower bound). Let M > 3 and n be two integers such that 21og 2 M <n 
and K> 1. Assume that X is infinite. There exists an absolute constant C > and a set 
of prediction rules T = {/i, . . ., /m} such that for any procedure f n with values in R, 
there exists a probability measure n satisfying MA(k ), for which 



where C — c K (4c) 1 2 2k ' k 1 )/( 2k 1 '(log2) k /( 2k x ) and c is the constant appearing in 
MA(k). 

5. Proofs 



Proof of Proposition 1. Since, for any function / from X to { — 1, 1}, we have 2(R(f) — 
R*) = A(f) - A*, it follows that MA(«) is implied by MAH(k). 

Assume that MA(k) holds. We first explore the case k > 1, where MA(k) implies that 
there exists a constant ci > such that F(\2r](X) - 1] <t) < cit 1/(K_1) for any t > (cf. 
Bouchcron et al. [7]). Let / be a function from X to [—1,1]. We have, for any t > 0, 




E[R{f n ) - R*} > min(i?(/) - R*) 




A(f)-A*=E[\2 V (X)-l\\f(X)-r(X)\] 

>tE[\f(x)-r(x)\t l2r , ix) _ ll > t ] 

> t(E[\f(X) - f*(X)\) - 2P(\2 V (X) - 1| < i)) 
>t(E[\f(X) - f*(X)\] -2c 1 t 1 ^- 1 ^). 



For t Q = ((/c- 1)/(2cik)) k - 1 E[|/(A) - /*(A)|] K ~ 1 , we obtain 



A(f)-A*>((K-l)/(2c lK )) 



K - 1 E[\f(X)-f*(X)\ 
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For the case K = 1, MA(1) implies that there exists h > such that \2rj(X) — l\>h 
a.s. Indeed, if for any JVgN* (the set of all positive integers), there exists An £ A (the 
a-algebra on X) such that P x (A N ) > and 1 277(2;) - 1| < A" 1 , Vx G A N , then, for 



-f*(x), ifxeA N , 
f*(x), otherwise, 



we obtain R{f N ) — R* < 2P X {A N )/N and E[\f N {X) - /*(A)|] = 2P X (A N ), and there 
is no constant c > such that P x (A N ) < c P x {A N )/N for all WeN*. So, assumption 
MA(1) does not hold if no h > satisfies \2r/(X) — 1\> h a.s. Thus, for any / from X to 
[-1, 1], we have A(f) - A* = E[\2 V (X) - l\\f(X) - f*(X)\] > hE[\f(X) - /*(A)|]. □ 

Proof of Theorem 1. We start with a general result which says that if is a convex 
loss, then the aggregation procedures with the weig hts w (n) (/): / G introduced in (4) 
satisfy 

4 0) (/f EW) )<4 0) (/ f f RM) ) + ^ and A^(/^ AERM) )<4 0) (/f RM) )- (16) 
Indeed, take to be a convex loss. We have <f>(Y f n (X)) < ^2f e:F w^ n \f)^)(Yf(X)), thus 

A^(f n )<J2^ n) (f)A^(f)- 

Any / s T satisfies 

AtfHf) = A^(fY RM) ) + 77- 1 (log(7,(")(/( ERM ))) - log( W W (/))), 

thus, by averaging this equality over the w^(f ) and using Y^f£F w ^ n \f) ^°s( W< m-P ) = 
K(w\u) > 0, where K(w\u) denotes the Kullback-Leibler divergence between the weights 
w = (w l - n " > (/))/eJ r an< i the uniform weights u = (1/M) /ejr, wc obtain the first inequality 
of (16). Using the convexity of <j>, we obtain a similar result for the AERM aggregate. 

Let /„ be either the ERM, the AERM or the AEW aggregate for the class T = 
{/1, . . . , /m}- In all cases, we have, according to (16), 

IoeM 

A n (fn)< min J 4„(/ i ) + -2— . (17) 

i=l,...,M n 

Let e > 0. We consider V = {/ e C : A(f) > A c + 2e}, where A c d = min /eC A(f). Let 
x > 0. If 

su ^(/)-^-(A n (/)-^ n (/*)) < 

/ep A(/) - A* + x ~ A c - A* + 2e + x 

then, for any / GD, we have 

A„(/) - A„(/*) > A(f) -A*- ^f^f^l >A c -A* + e, 
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because A(f) - A* > A c - A* + 2e. Hence, 



MjA n (f)-A n (f*))<A c -A*+e 



(18) 



< 



A(f)-A*-(Mf)-Mf*)) . 
A(f)-A* + x 



A c - A* + 2e + x 



According to (8), for /' e {/i, . . . , fu] such that A(f') = min^i^.^M A(fj), we have 
Ac = in£ feC A(f) = M fe{flt ..j M} A(f) = A(f'). According to (17), we have 

Mh)< min A n (f j ) + ] ^<A n (f , )+ ] ^-. 
j=i,...,M n n 

Thus, if we assume that A(f n ) > Ac + 2e, then, by definition, we have /„ G T> and thus 
there exists / € V such that A n (f) - Mf*) < A n (f) - Mf*) + (log M)/n. According 
to (18), we have 



V[A(f n ) >A c + 2e 
< 



mf A n (f) - Mf*) < Mf) - Mf*) + 



logM 



mf A„(/) - A n (f*) <A C - A* + e 



A n (f)-A n (f*)>A c -A* + , 



logM 



< 



sup 

./ec 



A(f)-A*-(A n (f)-A n (f*)) 
A(f)-A* + x 

A n (f)-A n (f*)>A c -A* + e- 



> 



A c - A* + 2e + x 
logM 



If we assume that 



sup 

/ec 



A(f) - A* - (Mf) ~ Mf*)) 



> 



A(f)-A* + x A c - A* +2e + x' 

then there exists / = Ylj=i w jfj ^ C (where Wj > and ^ Wj = 1) such that 

A(f) - A* - (A n (f) - A n (f)) 

A(f)-A*+x A c - A* +2e + x' 

The linearity of the hinge loss on [—1, 1] leads to 

A(f) - A* - (A n (f) - A n (f*)) 
A(f)-A*+x 
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Ejli ^M(fj) -a*- (A^) A n (.r))} 



and, according to Lemma 2, we have 

Ajf^-^-jAM^-AM*)) 



max 

j=l,...,M 



> 



A(fj ) - A* + x A c - A* + 2e + x ' 

We now use the relative concentration inequality of Lemma 5 to obtain 
A{f 3 )-A*-(A n (f 3 )-A n {r)) 



max 
i i v 



A{f J )-A*+x 



> 



<M 



r + 8c(A c -A* + 2e + x) 2 x 1 / R \^ ( 
IQ{A C - A* + 2e + x) 



A c - A* + 2e + x 
n{exY 



+ M\ 1 



3nex 



exp 



8c(A c - A* + 2e + a;) 2 x 1 A 
3nex 



16(A C - A* + 2e + x) 



Using Proposition 1 and Lemma 4 to upper bound the variance term and applying 
Bernstein's inequality, we get 



A n (f)-A n (r)>A c -A* + e 

n(e - (log M)/n) 



logM 



< exp 



4c(A c - A*) X I K + (8/3) (e - (log M)/n) 



for any e > (log M)/n. We take x — A c - A* + 2e, then, for any (logM)/n < e < 1, we 
have 



F(A(f n )>A c +2e) 



< 



exp 



n(e- log M/nf 



Ml 1 



4c{A c - A*) l / K + (8/3)(e - (log M)/n) 
2>2c{A c -A* + 2e) 1 / K 



exp 



32c(A c - A* + 2c) 1 /* 



M 1 



32 
3ne 



exp 



3ne 
~32~ 



Thus, for 2(logM)/n <u < 1, we have 



EL4(/ n )-A c ]<2 U + 2 / [T 1 (e)+M(T 2 ( e ) + r 3 (e))]de, 



(19) 
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where 

n(e- (log M) /n) 5 



T 1 ( e ) = cxp - 



4c((A c - A*)/2)V« + (8/3)(e - (logM)/n) 



64c(A c -A* + 26) 1 /^ / 2^ne 2 
T 2(e) = 1+ ttt— 5 cxp 



2 1 /«ne 2 y *\ 64c(A c -A* +26) 1 /- 



and 



/ s ( 16 \ / 3ne 
T ^=( 1+ 3^J eXP (-l6- 

Set /?! = min(32- 1 , (2148c)" 1 , (64(2c + 1/3))" 1 ), where the constant c> appears in 
MAH(k). Consider separately the following cases, (CI) and (C2). 

(CI) The case A c - A* > (\ogM/(f3 1 n)) K ^ 2K - 1 \ Denote by fi(M) the solution 
of /Lt = 3Mexp(-/i). We have (logM)/2 < fx(M) < logM. Take u such that 
(n/W) I '{A c - A*) 1 /* = fx(M). Using the definitions of case (CI) and fi{M), 
we get u < Ac — A* . Moreover, u > 4 (log M)/n, thus 

/ T 1 (e)de< exp -- , V, 7 __- de 

A/2 "A/2 V (4c + 4/3)(A c -,4*)V«J 

cx / n(e/2) 2 \ 
(a c -a.)/2 CXP V (8c + 4/3)eV^ 

Using Lemma 1 and the inequality u < Ac — A* , we obtain 

1 TAc)dc< M(2c+1/3)(A C -A*) 1/K 
u/2 1 ~ nu 



x exp 



(20) 



64(2c+l/3)(Ac- A*) 1 /* 
We have 128c(Ae — A* + u) < nu 2 . Thus, using Lemma 1, we get 

1 f (A c -A')/2 , ne 2 x 

r 2 (e)de<2/ cxp — — -zr-. — 1 de 

/a K ~ J u/2 V Mc(Ac-A*y/-J 

+ 2 / expf- ^'^ ^de (21) 



(A c -A-)/2 



128c 



< 2148c(A c - A*f/ K ^ / nu 2 



2148c(A c - A*) 1 /* 



Aggregation of classifiers 1015 
We have u > 32 {in)- 1 , so 



f 1 , x , 64 / 3nu 



/ 3nu 2 

< : exp 



3nu ^\ 64(A C - A*) 1 /* 

From (20), (21), (22) and (19), we obtain 

E[A(f n ) - A C ] <2u + 6M {Ac -f )1/K exp ( 



(22) 



nftu *\ (Ac-A*) 1 /" 



The definitions of u leads to E[A(/„) - A c ] < A^j 1 ( a c-a*)^\o S m _ 

(C2) TTie case A c - A* < (log M/ifcri))*^ 2 *- 1 ') . We now choose u such that 
n/W 2K_1)/K = fj,(M), where /3 2 = min(3(32(6c + l))" 1 , (256c)" 1 , 3/64). Using 
the definition of case (C2) and ju(M), we get u > Ac — A* . Using Lemma 1 and 
u > 4(log M)/n, u > 2(32c/n) K/(2K_1) and u > 32/(3n), respectively, we obtain 



f 1 T ( \ A <r 32 ( 6c + 1 ) ( ^u 2 - 1 /- 

f 1 128c 
/ , r2(e)d£ - „„i-iA exp 



32(6c+ 1) 

nv 2-l/ K 



128c 



(23) 



and 



f 1 64 / 3nM 2 " 1 / K \ , x 

L^^^^^i-^-y (24) 



it/2 

From (23), (24) and (19), we obtain 

np2u ' 

The definition of u yields E[A(f n ) - Ac] < 4( 1 ^) k/(2k_1) . 



Finally, we obtain 
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For the CAEW aggregate, it suffices to upper bound the sums by integrals in the following 
inequality to get the result: 

n 

e[a( / ] (cabw) ) _ A *] < l j2E[A(fi AEW) ) - A*} 

n k=l 

k=l ) 



<minA(/)- A* + C< 
~ fee 



□ 

Proof of Theorem 2. Let a be a positive number, T be a finite set of M real-valued 
functions and /i , . . . , Jm be M prediction rules (which will be carefully chosen in what 
follows). Using (8), taking T = {/i, . . . , Jm} and assuming that /* G {/i, . . . , /m}, we 
obtain 

inf sup (E[A(f n ) - A*] - (1 + a) min U(/) - A*) 

(25) 

>inf sup E{A(f n )-A*}, 
«/*e{/i, 

where Conv(^ r ) is the set made of all convex combinations of elements in T . Let N 
be an integer such that 2 1 < M, x\,... ,xn be A distinct points of X and w be a 
positive number satisfying (A — l)w < 1. Denote by P x the probability measure on 
X such that P x ({a;.,}) = w, for j = 1, . . . , N - 1, and P x ({x N }) = 1 - (N - l)w. We 
consider the cube f2 = { — 1, l}^ -1 . Let < h < 1. For all a = (<7i, . . . , ctat_i) e f2 we 
consider 



if ar = xi,.. . ,a;jv-i, 
if x = xn. 



For all <7 (E fl, we denote by 7r CT the probability measure on A" x {-1,1} having P x for 
marginal on X and for conditional probability function. 

Assume that re > 1. We have F(\2r) a (X) - 1| < t) = (N - l)wt h < t for any < t < 1. 
Thus, if we assume that (A — l)w < ft, 1 /^" 1 ), then P(|2r? CT (X) - 1| < i) < for all 

0< t < 1. Thus, according to Tsybakov [34], 7r CT belongs to V K . 

We denote by p the Hamming distance on f2. Let er, cr' € f2 be such that p(cr,<j') = 1. 
Denote by P the Hellinger distance. Since P 2 (tt®", tt® ™) = 2(1 - (1 - P 2 (7r CT ,7iv)/2) n ) 
and 

JV-l 



3=1 



2iu(l - Vl^/i 2 ), 
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the Hellinger distance between the measures 7r® n and 7r®, n satisfies 



H 2 (TT® n ,ir® n ) = 2(1 - (1 - w(l - Vl-hi)) n ). 



Take w and ft such that w(l- Vl - ft 2 ) < rT 1 . Then, H 2 {irf n , 7if,™ ) < /3 = 2 ( 1 - e~ 1 ) < 
2 for any integer n. 

Let cr € ri and /„ be an estimator with values in [—1, 1] (according to (12), we consider 
only estimators in [—1,1]). Using MA (re), we have, conditionally on the observations D n 
and for tt = 71V , 

(N-i \ K 

X .A.'-'',' ^ J • 

Taking here the expectations, we find E 7r<r [ J A(/ n ) — A*] > (cw) K E 7r<T [(X)^L~i \fn(%j) — 
(Tj |) K ] - Using Jensen's inequality and Lemma 6, we obtain 

M S up(E^[A(f n )-A*])>(cw) K (^^) . (26) 



4e 2 



Now take w= (nh 2 )'\ N = [log M/ log 2] and ft = (n^ 1 [log A// log 2] )(«-i)/(2«-i) . 
Replace w and iV in (26) by these values. Thus, from (25), there exist /i, . . . , /m (the first 
2 iv_1 are sign(2r/ £T — 1) for cr € £1 and any choice is allowed for the remaining M — 2 JV_1 ) 
such that, for any procedure /„, there exists a probability measure it satisfying MA(re), 

such that E[A(f n ) - A*] - (1 + ,n:n\:>, , u ;.l;./\i - A*) > Co^)^ 2 "" 1 ), where 

C = c K (4c) _1 2 _2K ( K ~ 1 ^( 2K " 1 )(log2)~ ft ' / ( 2K " 1 ). 

Moreover, according to Lemma 3, we have 



amm(A(f) - A*) + r 



Co {\ogM^ K/{2K - 1} 



> v ^I7^ /(min /e ^(/)-^ ) VM gM j 
V n 



Thus. 



E[4(/ n )-A*]>nun(i4(/)-A*)- 



Co /logM^ 2 "" 15 



For re = 1, we take ft = 1/2. Then, \2r] a (X) - 1| > 1/2 a.s., so tv eMA(l). It then 
suffices to take w = 4/n and N = [logM/ log 2] to obtain the result. 
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Proof of Corollary 1. The result follows from Theorems 1 and 2. Using inequality (3), 
Lemma 3 and the fact that for any prediction rule /, we have A(f) — A* = 2(R(f) — R*), 
for any a > 0, with t = a(A c - A*) and v = (C 2 (logM)/n) K /( 2K - 1 ) a - 1 /( 2K - 1 ), we obtain 
the result. □ 

Proof of Theorem 3. Denote by /„ the ERM aggregate over T . Let e > 0. Denote by 
T t the set {/ e T : R(f) > Rjr + 2e}, where R T = min /e ^i?(/). 
Let x > 0. If 

R(f) - R* - (R n (f) - Rnif*)) < e 
/e£ #(/ ) — R* +x ~ - R* + 2e ' 

then the same argument as in Theorem 1 yields Rn(f) — R n (f*) > Rp — R* + e for any 
/£f e . So, we have 



inf i?„(/) - ) < i?^ - i?* + e 

7 e 



< 



fi(/)-fl*-(fln(/)--Rn(/*)) 

j^| e J?(/) - i?* + a; fl^ - i?* + 2e + x 



We consider /' € J 7 such that mm fe r R(f) = R(f'). If i?(/„) > R T + 2e, then /„ e ^* e , 
so there exists g £ T t such that R n (g) < R n {f). Hence, using the same argument as in 
Theorem 1, we obtain 



»[i?(/„) > Rjr + 2e] < 



R(f) - R* - (R n (f) ~ Rn(D) > 
/e£ R(f)-R* + x ~Rjr-R*+2e + x 

F[R n (f')-R n {f*)>Rr-R* + e]. 



We complete the proof by using Lemma 5, the fact that for any / from X to {—1, 1}, 
we have 2(R(f) — R*) = A(f) — A* , and the same arguments as those developed at the 
end of the proof of Theorem 1 . □ 

Proof of Theorem 4. Using the same argument as the one used in the beginning of 
the proof of Theorem 2, we have, for all prediction rules /i, • • ■ , Jm and a > 0, 



sup inf sup E[R(f n ) - R*] - (1 + a) min (R( gj ) - R*) 

gi,...,g M f n irEPs V j=l,...,M 

>inf sup E[R(f n )-R*]. 

fn 7TEV K 

/*G{/i,...,/m} 
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Consider the set of probability measures {ir a , a <E fl} introduced in the proof of Theorem 
2. Assume that k > 1. Since for any a € O and any classifier /„, we have, by using MA(k), 



E^[R(f n )-R*]>(c w) 



using Jensen's inequality and Lemma 6, we obtain 



inf sup (E^ [£(/„) -R*})> (c w) 



AT-1 
4e 2 



By taking w = (rife 2 ) -1 , A = [log A// log 2] and ft = (n^ 1 [log A-// log 2] )(«-i)/(2«-i) ) 
there exist fx,..., fu (the first 2 W_1 are sign(2r? CT — 1) for cr e SI and any choice is allowed 
for the remaining M — 2 JV_1 ) such that for any procedure /„, there exists a probability 
measure 7r satisfying MA(«), such that E[R(f n ) — R*] - (l + a)mm j=1 .... yM {R(fj)-R*) > 
Co( l2 fr £ ) K/(2K ~ 1) , where C = c K (4e)- 1 2- 2K ( K - 1 )/( 2K - 1 )(log2)" K /( 2K - 1 ). Moreover, ac- 
cording to Lemma 3, we have 



amm[ R(f) — R* 



Co (\ogM 
2 



k/(2k-1) 



> Ja^-Co/2 



(min /e ^-R(/)-iJ*)V«logM 



The case k = 1 is treated in the same way as in the proof of Theorem 2. 
Lemma 1. Let a > 1 and a, b > 0. An integration by parts yields 

exp(— ba a ) 



exp(-bt a ) dt < 



aba° 



Lemma 2. Let &i,...,6j\/ be M positive numbers and a\,...,aM some numbers. We 
have 



< max 



EZx b 3 ' 1 wV/ '< 



Proof. 



M , x Af M 



□ 
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Lemma 3. Let v,t > and k > 1. The concavity of the logarithm yields 

t + V> t 1 /(2K) u (2K-l)/(2«)^ 

Lemma 4. Let f be a function from X to [—1,1] and ir a probability measure on X X 
{ — 1, 1} satisfying M.A(k) for some K> 1. Denote by V i/ie symbol of variance. We have 

V(Y(f(X)-f*(X)))<c(A(f)-A*) 1/K 



V(V/pc)<o - lr/«(x)<o) < c(i2(/) - JZ*) 



1/k 



Lemma 5. Let J 7 = {fx, . . . , /m} ^ e a finite set of functions from X to [—1,1]. Assume 
that 7r satisfies MA(kJ /or some ft > 1. We have, for any positive numbers t,x and any 
integer n, 



maxZ, / >t <M 1+ . exp - v ^ 
where the constant c > appears in MAH(/tJ and Z x {f) 



16 \ / 3nta; 
1 + — I cxp 



Zntx 



16 



))■ 



_ A(/)-A„(/)-(A(/-)-A„(/-)) 
- A(/)-A*+x 



Proof. For any integer j, consider the set jFj = {/ £ T :jx < A(f) — A* < (j + l)x}. 
Using Bernstein's inequality, Proposition 1 and Lemma 4 to upper bound the variance 
term, we obtain 



max Z x (f) > t 



+00 

3=0 
3=0 



m&xZ x {f) >t 



max A(f) - A n (f) - W*) ~ Mf*)) > Kj + l)x 



+ 00 



< MV exp 



3=0 
/ + oc 



n[t{j + l)x] 



4c((j + l)x)V K + {8/3)t(j + l)x 



- M E exp 

\i=o 



n(tx) 2 (j + 1) 2 - 1 /"^ / . N 3nta; 

* +exp -0 + 1) 



Sea; 1 /" 



16 



< M exp 



M 



-foe 



cxp 



nt z x 



exp 



2 -2-1/ K 



3ntx 
16 



.a-iA 



)) 



8f 



exp | — u 
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Lemma 1 leads to the result. 



Lemma 6. Let {P^/lu £ Q} be a set of probability measures on a measurable space 
(X,A), indexed by the cube fl = {0, l} m . Denote by E w the expectation under P u and 
by p the Hamming distance on Q. Assume that 



Vw,w' G n/p{u3,u') = 1, H 2 (P U ,P U ,) < a < 2, 



Then, 



inf max I 

w£[o,i] m wen 



W-j — uu 



> 



a 



□ 



Proof. Obviously, we can replace inf^ e r 0i i]»> by (1/2) mf^, e / 0j i}m since for all w £ {0, 1} 
and w £ [0, 1], there exists w £ {0, 1} (e.g., the projection of w on to {0, 1}) such that 
\w-w\> (l/2)\w - w\. We then use Theorem 2.10 of Tsybakov [33], page 103. □ 
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